world language
Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?
Törö, Tuukka, Suni, Antti, Šimko, Juraj
Investigating linguistic relationships on a global scale requires analyzing diverse features such as syntax, phonology and prosody, which evolve at varying rates influenced by internal diversification, language contact, and sociolinguistic factors. Recent advances in machine learning (ML) offer complementary alternatives to traditional historical and typological approaches. Instead of relying on expert labor in analyzing specific linguistic features, these new methods enable the exploration of linguistic variation through embeddings derived directly from speech, opening new avenues for large-scale, data-driven analyses. This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model voxlingua107-xls-r-300m-wav2vec, to analyze relationships between 106 world languages based on speech recordings. Using linear discriminant analysis (LDA), language embeddings are clustered and compared with genealogical, lexical, and geographical distances. The results demonstrate that embedding-based distances align closely with traditional measures, effectively capturing both global and local typological patterns. Challenges in visualizing relationships, particularly with hierarchical clustering and network-based methods, highlight the dynamic nature of language change. The findings show potential for scalable analyses of language variation based on speech embeddings, providing new perspectives on relationships among languages. By addressing methodological considerations such as corpus size and latent space dimensionality, this approach opens avenues for studying low-resource languages and bridging macro- and micro-level linguistic variation. Future work aims to extend these methods to underrepresented languages and integrate sociolinguistic variation for a more comprehensive understanding of linguistic diversity.
New study tests machine learning on detection of borrowed words in world languages
Lexical borrowing is very widespread and may affect even those words that play an important role in our daily life. English'mountain', for example, was borrowed from Old French, along with many other words. Researchers from the Pontificia Universidad Católica del Perú and the Max Planck Institute for the Science of Human History have investigated the ability of machine learning algorithms to identify lexical borrowings using word lists from a single language. Results published in the journal PLOS ONE show that current machine-learning methods alone are insufficient for borrowing detection, confirming that additional data and expert knowledge are needed to tackle one of historical linguistics' most pressing challenges. Lexical borrowing, or the direct transfer of words from one language to another, has interested scholars for millennia, as evidenced in Plato's Kratylos dialog, in which Socrates discusses the challenge imposed by borrowed words on etymological studies.
New study tests machine learning on detection of borrowed words in world languages
IMAGE: Lexical borrowing is very widespread and may affect even those words that play an important role in our daily life. English'mountain', for example, was borrowed from Old French, along... view more Lexical borrowing, or the direct transfer of words from one language to another, has interested scholars for millennia, as evidenced already in Plato's Kratylos dialogue, in which Socrates discusses the challenge imposed by borrowed words on etymological studies. In historical linguistics, lexical borrowings help researchers trace the evolution of modern languages and indicate cultural contact between distinct linguistic groups - whether recent or ancient. However, the techniques for identifying borrowed words have resisted formalization, demanding that researchers rely on a variety of proxy information and the comparison of multiple languages. "The automated detection of lexical borrowings is still one of the most difficult tasks we face in computational historical linguistics," says Johann-Mattis List, who led the study. In the current study, researchers from PUCP and MPI-SHH employed different machine learning techniques to train language models that mimic the way in which linguists identify borrowings when considering only the evidence provided by a single language: if sounds or the ways in which sounds combine to form words are atypical when comparing them with other words in the same language, this often hints to recent borrowings.
Amid coronavirus, students flock to Kahoot! and Duolingo. Is it the end of language teachers?
Every day, Massachusetts seventh-grader Kaylyn Wilson takes a break from doing homework online and opens an app on her phone for a half-hour foreign language lesson. "The boy has three green bikes and an egg," the 12-year-old announced to her family in French at the start of her third week using the mobile app from Rosetta Stone, the language-learning software giant. Wilson doesn't yet need to study a language for credit. But during the school shutdowns to contain the coronavirus, her father saw Rosetta Stone advertise free accounts for students – an offer other language-learning software companies have made as well. Wilson decided to give it a go.